Taxonomic identification accuracy from BOLD and GenBank databases using over a thousand insect DNA barcodes from Colombia

Recent declines of insect populations at high rates have resulted in the need to develop a quick method to determine their diversity and to process massive data for the identification of species of highly diverse groups. A short sequence of DNA from COI is widely used for insect identification by comparing it against sequences of known species. Repositories of sequences are available online with tools that facilitate matching of the sequences of interest to a known individual. However, the performance of these tools can differ. Here we aim to assess the accuracy in identification of insect taxonomic categories from two repositories, BOLD Systems and GenBank. This was done by comparing the sequence matches between the taxonomist identification and the suggested identification from the platforms. We used 1,160 COI sequences representing eight orders of insects from Colombia. After the comparison, we reanalyzed the results from a representative subset of the data from the subfamily Scarabaeinae (Coleoptera). Overall, BOLD systems outperformed GenBank, and the performance of both engines differed by orders and other taxonomic categories (species, genus and family). Higher rates of accurate identification were obtained at family and genus levels. The accuracy was higher in BOLD for the order Coleoptera at family level, for Coleoptera and Lepidoptera at genus and species level. Other orders performed similarly in both repositories. Moreover, the Scarabaeinae subset showed that species were correctly identified only when BOLD match percentage was above 93.4% and a total of 85% of the samples were correctly assigned to a taxonomic category. These results accentuate the great potential of the identification engines to place insects accurately into their respective taxonomic categories based on DNA barcodes and highlight the reliability of BOLD Systems for insect identification in the absence of a large reference database for a highly diverse country.


Introduction
More than a million species of plants and animals are in danger of extinction according to the latest global assessment of the intergovernmental platform on biological diversity and ecosystem services [1]. Declines in biodiversity and projections of a sixth mass extinction [2] have placed the impact of such losses on the agendas of governments all over the world [3]. Although insects constitute one of the most diverse groups on the planet, many species remain undiscovered and recent evidence suggests the decline of their populations at high rates [4,5]. To date, the risk of extinction of 12,161 species of insects has been evaluated according to the criteria of the IUCN red list and 2,291 species are considered "threatened" around the world [6].
Colombia is known as a biodiversity hotspot in the world. Although insects are the most studied class of animals in the country [7], research involving their genetic diversity is still currently underrepresented. Countries around the world are implementing massive sampling of insects with barcode generation under the premise that barcodes will allow us to assess biodiversity at a higher speed [8]. Barcodes in taxonomic works are showing an incredibly hidden diversity of insects with descriptions of hundreds of new species facilitated by DNA, morphology, and natural history [9,10]. As a megadiverse country, Colombia has great challenges to complete the inventory of its biodiversity and then the sustainable use of its resources. In particular, accelerating species identification using an integrative taxonomy is a priority.
Assessing biodiversity via sampling and DNA barcoding has been successfully applied in areas where coordinated sampling and barcoding facilitated the creation of extensive libraries for a particular region. An example of such collaboration is described in Morinière et al. [8] where the Bavarian State Collection of Zoology (ZSM), the Barcoding Fauna Bavarica (BFB), and the German Barcode of Life (GBOL) project generated a library of 120,000 reliably identified species in Coleoptera, Diptera, Hymenoptera, and Lepidoptera for Germany and Central Europe. A closer example in the Neotropics is Costa Rica which has been leading the Bioalfa project and within the last 15 years has generated 45,000 insect species barcodes recorded [11]. Added to this initiative are other countries across the globe such as Canada, Ecuador, Sweden, and Singapore [9,[12][13][14][15][16]. Moreover, the development and implementation of effective methods for species monitoring are important for assessing biodiversity, and overall health of ecosystems [8]. The species richness of insects in terrestrial ecosystems facilitates their use as bioindicators for measuring biodiversity and tracking the effects of changes in environmental conditions. The challenges faced by biodiversity declines have resulted in a need to generate new strategies for monitoring that are capable of generating vast amounts of data in a rapid and cost-effective manner. One strategy that has been gaining traction in recent years has been the implementation of DNA barcoding.
Barcode sequences are generated from a standardized region that consists of short DNA sequences (generally between~300 to~700 bp). The mitochondrial cytochrome c oxidase subunit I (COI) is most often used for insect barcoding [17][18][19]. A DNA barcoding repository is built from sampling the DNA of individuals that are retained as vouchers, from which a photograph is taken. In the case of insects, normally a leg is removed from the specimen and used to extract the DNA. In some cases, whole bodies of the samples are used when the genetic material available is limited by the size of the specimen. Although non-destructive DNA extraction methods are preferred to preserve the sample's morphology for future reference, photographs are also suggested for facilitating the identification before this step. This approach requires the creation of a large DNA reference database where the identification of unknown specimens is achieved through the information confirmed and curated in the database. The idea behind is that unknown samples could be identified by matching their sequences to the species that are  [20] is a platform specifically developed for this type of information. The BOLD platform provides the identifications match based on BINs (Barcode Index Numbers) [21] that are automatically generated through algorithms for the specimens. Although unidentified species coded in BINs can help estimating diversity, the identification of species is also possible using the Basic Local Alignment Tool (BLAST) from GenBank [22,23]. BLAST allows matching sequences to a query sequence. The search consists of comparing nucleotide sequences hosted in the database and to provide statistical significance of these matches. The main goal from BLAST is not necessarily to provide species identification because the same principle can be used for the search of gene families; uses of identification databases are vast in biology. Scientists frequently use these two platforms to identify species from barcodes. An assessment of these two main public repositories from curated samples across taxa found a slightly better performance from GenBank vs BOLD for insects (n = 17) at species and genus level identification without statistical significance [24]. Later, a revision of this insect data found the results to be more similar than previously reported for this group [25]. Accurate identification of species and overall composition in ecosystems based on sampling is a key factor for conservation initiatives. Therefore, evaluating the performance of these platforms assigning specimens to a specific taxonomic category level has a significant impact on the advance of other fields.

PLOS ONE
Here, we analyzed a large inventory of over 1,000 insect DNA barcodes from Colombia which includes the first strepsipteran barcode for our country. We aim to compare the matches of the taxonomist identification to species, genus, and family level with the search engines of BOLD and BLAST service from GenBank in order to provide suggestions and next steps for the use of these tools. We assess the accuracy of the species identification engine from BOLD by using a subset of insects corresponding to Scarabaeinae subfamily (Coleoptera: Scarabaeidae).

Collection locality/geography
Insect samples were collected from ten departments of Colombia through different projects (Colombia BIO, Santander BIO). The majority of the samples (81.0%) came from Antioquia, Vichada, and Santander. The spatial data of the localities were processed in ArcGIS 10.2 (License E300 04/26/2013) (Fig 1). The elevation of the samples covered a wide range with the lowest sampling occurring at 56 masl and the highest at 3,563 masl. Although samples were collected using seven different entomological collecting methods (malaise, pitfall, light trap, hand-picked, entomological net, Van Someren-Rydon trap and folding net), nine malaise traps accounted for 62.6% of the total collected material, followed by pitfall (17.2%), and light traps (8.1%). Based on the article 2.2.2.8.1.2 decree 1076 from 2015 of the environmental sector and sustainable development of Colombia, The Instituto de Investigación de Recursos Biológicos "Alexander von Humboldt" was not required to obtain a field permit when collecting specimens from the wild that are applied in non-commercial research.

Taxa description
Our dataset included insects from the orders Coleoptera, Diptera, Hymenoptera, Lepidoptera, Hemiptera, Odonata, Psocodea and Strepsiptera, from which Coleoptera is the most represented (35%). Specimens were identified morphologically by students and by specialists from order to species level (S1 Table); additionally, for Scarabaeinae, species were identified to morphospecies level following González-Alvarado et al. [27]. Currently, few samples (7%) have been identified to lower levels from molecular data by BOLD Data Manager. Vouchers were

DNA sampling and extraction
One leg was removed from each insect, when possible, but some samples required the use of the full body to provide enough DNA. All samples were sent to the molecular laboratory of Instituto Humboldt at CIAT, Palmira. Extraction protocols were conducted following the high throughput DNA isolation from Ivanova et al. [28]. The DNA concentration was estimated by quantification using the NanoDrop 1000 (Thermo Fisher Scientific). Amplification of cytochrome c oxidase I (COI) fragments (658 bp) took place by polymerase chain reaction (PCR) with a laboratory prepared master mixture containing 2 μL of template DNA (~10-50 ng), 1X Taq buffer ((NH4)2SO4), 200 μM of each deoxynucleoside triphosphate, 2 mM MgCl 2 , 0.2 μM of each primer, 0.4 μg/μL bovine serum albumin, and 1 U of Taq DNA polymerase. A cocktail of four primers was used: LCO1490 5'-GGTCAACAAATCATAAAGATATTGG-3', HCO2198 5'-TAAACTTCAGGGTGACCAAAAAATCA-3' [29], Lep-F1 5'-ATTCAACCAATCATAAA GATATTGG-3' and Lep-R1 5'-TAAACTTCTGGATGTCCAAAAAATCA-3' [30]. Primers were selected following Canadian Centre for DNA Barcoding (CCDB) suggested primers for taxonomic groups. The PCR cycling conditions consisted of a first cycle of denaturation at 94˚C for 3 min; followed by 5 cycles of 94˚C for 30 s, 45˚C for 40 s, and 72˚C for 1 min; then 35 cycles of 94˚C for 30 s, 51˚C for 40 s, and 72˚C for 1 min; and a final extension cycle at 72˚C for 5 min. PCR products were visualized by electrophoresis on 1.5% agarose gels stained with SYBR Safe (Thermo Fisher Scientific) and using 1X TBE. The ExoSAP-IT protocol was used to clean PCR products, after which Sanger sequencing of amplicons were carried out at The Universidad Nacional de Colombia and the University of the Andes. Sequences were assembled and edited in Geneious 10.0.9 [31]. All files were uploaded in BOLD systems and are found in the dataset DS-CBIH2020; Dataset available at: doi.org/10.5883/DS-CBIH2020.

Data analysis
BOLD Batch ID engine. To identify samples to species level, we ran BOLD's Identification Engine in datasets created for insect orders in July 2020. The identification engine allows the user to select the search within the COI Species Database or COI Full Database. We selected COI Species Database which will match the sequence against an identified species as opposed to individuals without identification to species level. We kept filters as default for a minimum overlap of 300 bp and sequence length � 100 bp. We ran the analysis with 97% similarity, a percentage that accounts for within species variation and error of the sequence [8]. Under the criteria selected of 97% numerous species ID were not obtained. We decided to decrease the criteria and include matches using the 80% similarity filter due to the lack of matches at 97%. The 80% similarity is the minimum filter allowed in BOLD. A match was accepted as correct when the identification by the taxonomist was in agreement by the suggested identification by BOLD. The ID search engine provides a list of maximum 100 closest specimens, and we selected the first suggestion from the list that has the highest percentage match and base pair overlap. We included an alternative match when the suggestion was done towards a sequence from our project and ran statistical tests separately to prevent circular analysis of results. Matches were checked at species, genus, and family level.

BLAST.
We used BLAST from GenBank in Geneious 10.0.9 by default to look for matches of the sequences. The parameters included were the database search from Nucleotides collection, program set as the MegaBlast, results displayed as a Hit table, and retrieve matching region with maximum hits of 100. The search was set to provide a list of 100 closest nucleotide sequences, from which we selected the first suggestion that obtained the highest Bit-score or Max score. Most of the times, the E value was 0 (or closest to zero) and the query cover was 100%. Matches were checked at species, genus and family level.
The identification analyses in both approaches (BOLD and BLAST) followed the assumption that species included on the databases are correctly identified; however, this approach might not be accurate in cases where the most similar species were wrongly identified.
Dung beetles (Scarabaeinae) case study. The Scarabaeinae subfamily was selected due to the high representation of this group in the samples of Coleoptera (42%) and from the total insect samples (15%). In order to evaluate the accuracy of species identification suggested by BOLD, we consulted taxonomists with expertise in the identification of the Scarabaeidae family. They were asked to reanalyzed the results and mention if they agree with the species and genus identification suggested by BOLD and to explain why, as follows: Correctly identified genus, correctly identified species, probable correctly identified species, incorrect identification (Table 1). When available, specimens deposited at the IAVH-E collection and photographs uploaded to BOLD were checked for specimens to corroborate the taxonomy, and field notes from the collectors.
Statistical analysis. Chi-square analyses were performed between match results for BOLD and GenBank at the studied taxonomic levels (species, genera and family) to evaluate if there was a significant difference between the platforms performance. An additional series of Chi-square analyses were performed to identify differences in the platforms by each order of insect at the species, genus, and family level. We adjusted our significance threshold at each taxonomic level based on the number of orders compared using the Bonferroni Correction. Flagged records were removed from the analysis.
The non-parametric Kruskal-Wallis test in PAST, Version 4.05 [32] was used to determine any significant differences of medians for BOLD % match of the correct Scarabaeinae categories against incorrect identification. Assumptions for an ANOVA were not fulfilled. Dunn's post hoc was carried out to identify differences between categories.

Description of the data
A total of 1,160 COI sequences of insects from Colombia were generated since 2019 (S1 Table). From this set, 1,088 sequences were barcode compliant and placed in 708 BINS ( Table 2). Of these 708 BINS, 500 were identified as unique (Fig 2). The most frequent BINS were all in Coleoptera, BOLD:ADN3981 Paulosawaya sp. (11 specimens), BOLD:ADO6609 Paulosawaya ursina (Blanchard, 1850) (10 specimens), and BOLD:ADJ9394 Ptilodactylidae (9 specimens). From the specimens with barcode compliant 33 were flagged as problematic records in most instances due to suspected contamination, specimen mix-up or misidentification.

PLOS ONE
Taxonomic identification accuracy from BOLD and GenBank databases using insect barcodes P = 0.006). No differences were found for the other orders tested at the different taxonomic categories. Same results were found when analyses were performed with matches to data from our project, the only comparison that changed was at the genus level where no significant differences were found for Lepidoptera (S3 Table).

Case of study Scarabaeinae subfamily (Coleoptera: Scarabaeidae)
A total of 172 specimens were analyzed. The samples in the subfamily Scarabaeinae were assigned by the taxonomist to 16 genera and 27 species (S1 Table). BOLD engine with the search at 80% provided identifications for all 172 specimens analyzed (Fig 4A and S2 Table). A match in agreement with the taxonomist was found for 82 (80%) specimens out of 103 at the genus level. For species, the search resulted in 55 (78%) specimens out of 69 in total. Based on this information, these records were classified as 137 (80%) in agreement with the taxonomist and 35 (20%) in disagreement (Table 3 and Fig 4A).
Then, we corroborated de novo the identification for these 172 records of dung beetle with a taxonomist (Fig 4B). The de novo specimens were assigned to the same 16 genera and 27 species (70 specimens), in addition to 31 morphospecies (genus and code "H", 97 specimens)  (Tables 4 and S4). The new review resulted in 95 records (55%) correctly matched the taxonomist species identification or was deemed likely to be the species (Correctly identified species and probable correctly identified species categories). At the genus level, 52 records (30%) correctly matched the taxonomist genus identification but either were not accurately matched with the species or the species identification was not provided (Correctly identified genus category). There were 25 records (15%) at the genus level which did not match the identification of the taxonomist and arguments against BOLD identification were provided by the taxonomist (Incorrect identification category) (S4 Table). Therefore, the records previously classified as 137 (80%) increased to 147 (85%) in agreement with the taxonomist and were reduced from 35 (20%) to 25 (15%) in disagreement based on the BOLD search engine at 80% (Table 3 and Fig 4A).
Correctly identified species had a percentage match range in BOLD from 93.4 to 100 ( Fig  5). Probable correctly identified species ranged from 94.85 to 98.04. Correctly identified genus range was between 87.7 and 100%. Incorrect identification percentages range from 83.59 to 99.83. The categories were significantly different (Kruskal-Wallis H = 73.4, P < 0.001). The correctly identified species (P < 0.001), probable correctly identified species (P < 0.05) and correctly identified genus (P < 0.05) categories were different from the incorrect identifications.

Discussion
In our study, BOLD systems outperformed BLAST, providing more accurate insect identifications. This overall pattern was found for all taxonomic levels evaluated (species, genus and family). Although differences were found between taxa, BOLD consistently provided higher identifications for the order Coleoptera at all taxonomic levels and for Lepidoptera at genus and species level. For instance, Hymenoptera and Odonata performed similarly for both repositories at all taxonomic levels, alike most of the other orders at the family level when analyzed separately. These results are partially in disagreement with Meiklejohn et al. [24] which reported no significant differences between BOLD and GenBank for insect samples identification at genus and species level. A further review of their data showed that correct identifications from both platforms were closer than estimated previously [25]. Although we report different results in the performance of BOLD compared to GenBank in this study, a key factor to consider is the magnitude of insect data involved. Meiklejohn et al. [24], who studied 17 specimens in 12 orders at genus and species level using COI whereas we included 1,160 specimens in eight orders at family, genus and species level. BOLD Systems is a platform developed with the intention of providing species identification and in that sense does not perform equally to GenBank, but both are the main public repositories often used to provide identifications. Often BOLD is considered to contain data more curated, but GenBank data quality is good overall [25,34]. Differences in identifications of taxa can be considered due to differential taxonomic curation of the databases and the lack of barcodes of reference from Colombia. However, we could not identify a clear pattern with the lack of references. The most represented groups were Coleoptera and Lepidoptera. The group that performed the best with identifications was Coleoptera and it was also the order that provided the most unique BINS, in contrast with Lepidoptera that provided the least unique BINS also outperformed GenBank at the species and genus level, but no differences were found at family.
A great potential of Colombian insect biodiversity was found from BINs counting with 70% of the sequences assigned to a unique BIN. Each BIN represents a cluster of species or an operational taxonomic unit (OTU) that is compared against the database assigning the barcodes to known clusters (BIN) or creating a new cluster (unique BIN) [21]. Unique BINs were found across all the orders of insects and Coleoptera provided the highest count of new BINs followed by Diptera and Hymenoptera, some of the hyperdiverse orders of insects. However, Lepidoptera was the order with highest non-unique BINs. This finding is likely attributable to a great effort of DNA sampling directed to Lepidoptera that displays 4,848 barcodes from the country with 1,474 BINs [35]; in comparison to Coleoptera with 318 sequences in 82 BINs, Diptera 2,646 with 339 BINs and Hymenoptera 2,179 with 240 BINs [35].
Furthermore, our discovery of 500 unique BINs from our dataset of 1,088 sequences showcases the need for further sampling and identification in order to add more species into these databases and increase the accuracy of matches. However, this would require the implementation of a systematic method of collecting samples across Colombia, a process that could take a considerable length of time and amount of labor. Of our seven collecting methods used in this study, the malaise traps accounted for 62.6%, and we would suggest a targeted approach by placing these traps in strategic areas across Colombia. We expect that a larger sample will continue providing a higher number of unique BINs, but also, we see a necessary taxonomic effort to match all of this genetic diversity with the insect species already described from Colombia. Nevertheless, the implementation of DNA barcodes in the country will bring an opportunity to recognize any hidden diversity. As it has been suggested that BINs underestimate by a 10% the species due to lack of power within close species [11,36,37]. Although Meier et al. [38] suggested awareness on the use of BINS as identification tool specially when use for species descriptions. Therefore, an integrative approach with morphology and genetic diversity can boost the recognition of species.
A closer look to the subfamily Scarabaeinae displayed a great performance of identification at different levels. The percentage of similarity for a match (BOLD match percentage) to identify species correctly in Scarabaeinae was in a wide range between 93% to 100%. Although correct matches were statistically significant the range was not perfect, as shown at least three specimens were incorrectly identified with values higher than 94%. These outliers do not reduce the potential of genetic identification tools. The high percentages could be due to specimen mix-up. Overall, the identification success rate of Scarabaeinae was 85% (30% towards genera and 55% towards species) when combining the barcode results with a taxonomist effort (Table 3 and Fig 4A). The species identification improved from 78% to 96% (or in the overall from 30% to 55%) with this taxonomist de novo identifications (Table 3 and Fig 4A). Other studies of identification of beetles using barcodes found species matches as high as 92.1% of their samples when considered traditionally identified species and BINS and as high as 98.3% of the samples when haplotypes were considered [39]. In comparison, our identification percentage was lower with previous studies. These can be due to the lack of sequence records of the species in the database. Pentinsaari et al. [39] suggested identification failures are due to difficulty with morphological characters and controversial taxonomic status of species. Taxonomic input is crucial for the function of this tool, also if the species is not represented in the database there is no possible match to occur. In fact, a few suggestions provided by BOLD for incorrect species identifications were for close related species or species within their group of species. For example, Dichotomius andresi and Dichotomius protectus were matched with Dichotomius satanas, a morphologically related species in the "Dichotomius satanas species group". Similarly, Eurysternus hypocrita was identified as Eurysternus olivaceus, a species in the "Eurysternus velutinus species group". The specimen identified as Phanaeus pyrois was matched as Phanaeus malyi, a species that was long considered synonymous with P. pyrois. Currently, P. malyi has been revalidated and registered in Colombia without a specific location [40]. Table 3. Summary of matches between the taxonomist and BOLD identifications for the subfamily of dung beetles Scarabaeinae. The first section (I) corresponds to the initial matches performed in agreement with the taxonomist after the search engine in BOLD at 80%. The second section (II) in grey includes the categories used by the taxonomist to reevaluate the identifications provided by BOLD in (I) and the final matches in agreement.  The distribution range of species could be an additional factor as observed with the dung beetles data where BOLD identify all species correctly in six genera (Coprophanaeus, Digitonthophagus, Malagoniella, Onthophagus, Phanaeus and Sulcophanaeus), eight genera partially correct with some of the species identified correctly but not all (Ateuchus, Canthidium, Canthon, Deltochilum, Dichotomius, Eurysternus, Ontherus and Uroxys) and failed completely for the genera Pseudocanthon (1 specimen) and Scybalocanthon (6 specimens). The correctly identified species were obtained for species with a wide distribution and shared between South America and Central America, such as Canthon juvencus, Canthon subhyalinus, Eurysternus  [41]. These species are found across countries or ecosystems (e. g. dry and humid forests, natural and disturbed ecosystems). In contrast, the identification was incorrect for some species with restricted distributions in Colombia such as Canthon arcabuquensis, Dichotomius andresi, Uroxys cuprescens, and Deltochilum susanae, the latest was first identified as the morphospecies Deltochilum sp. 12H [42] and later described as a new species [43] or species with Andean distributions as in Dichotomius protectus, Eurysternus marmoreus and Ontherus brevicollis [44][45][46][47][48][49]. In the case of Scarabaeinae dung beetles, the barcode results helped to confirm at least five species that were assigned as morphospecies in the dung beetles of Colombia "Reference Collection" hosted at of IAVH Entomological Collection. The diversity of dung beetle's subfamily Scarabaeinae in Colombia exceeds by a large number of species that of neighboring countries such as Panama, Ecuador or Brazil. In general, a defined and well-identified species in Colombia has other species, morphologically similar and taxonomically closer, that have been assigned as morphospecies. This is why the Reference Collection was created, which so far houses about 85% percent of species of Colombia as morphospecies. We see the barcode as a fundamental integrative taxonomy tool, which will help to outline these morphospecies and define which of those are new species for science, and to improve the final taxonomic list of this important group of beetles. Overall, there are still challenges to overcome and more research is recommended in this topic. The reference databases keep improving and updating their resources and tools for users [23,34]. BOLD provided reliable identifications for the tested groups of insects overcoming challenges such as a small genetic reference of insects for a highly diverse country.

Matches BOLD & Taxonomist
Supporting information S1

S2 Table. Records of insects by orders with matches from BOLD Batch ID engine and BLAST (GenBank). (XLSX)
S3 Table. Chi-squared tests. A) Dataset analyses including BOLD's first suggestion with the highest percentage match and base pair overlap. This dataset includes matches to sequences on our project. B) Alternative dataset analyses including next higher percentage match and base pair overlap. This excludes matches to sequences from our project. *Is next to orders where Chi-square was significant. **Is next to orders where Chi-square analyzed with Bonferroni correction was significant. (XLSX) S4 Table. List of Scarabaeinae in other categories as determined by the taxonomist. (XLSX)